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We present a new method to extract distance and orientation dependent potentials between amino 
acid side chains using a database of protein structures and the standard Boltzmann device. The 
importance of orientation dependent interactions is first established by computing orientational or- 
der parameters for proteins with a-helical and /3-sheet architecture. Extraction of the anisotropic 
interactions requires defining local reference frames for each amino acid that uniquely determine 
the coordinates of the neighboring residues. Using the local reference frames and histograms of the 
radial and angular correlation functions for a standard set of non-homologue protein structures, we 
construct the anisotropic pair potentials. The performance of the orientation dependent potentials 
was studied using a large database of decoy proteins. The results demonstrate that the new dis- 
tance and orientation dependent residue-residue potentials present a significantly improved ability 
to recognize native folds from a set of native and decoy protein structures. 

Keywords: statistical coarse-grained potentials, side chain orientation, Boltzmann device, protein structure 
prediction, side chain packing, contact order 



I. INTRODUCTION 

A major goal of proteomics is to determine structures 
of proteins rapidly. It is impractical to obtain very high 
resolution structures on a genome wide scale. Further- 
more, for processing biological functions it is important 
to characterize the dynamics and thermodynamics of a 
network of proteins. From a computational perspective, 
it is currently impossible to realize these goals using all- 
atom molecular dynamics simulations^. Thus, there 
is an urgent need to develop coarse-grained, yet reli- 
able, models for protein structures. Construction of such 
models requires the determination of interaction poten- 
tials between amino acid residues. The wealth of struc- 
tural data on a number of proteins in the Protein Data 
Bank (PDB) 3 has been a source for obtaining interaction 
potentials"^. 

The idea of using the frequencies of amino acid pairing 
to determine potential interaction parameters was first 
proposed by Tanaka and Scheragai With the exception 
of a few studies^, most of the "knowledge-based" poten- 
tials have been obtained solely in terms of residue- residue 
contacts. Sippl^ and others have introduced an explicit 
distance dependence in the database-derived mean force 
potentials using the Boltzmann formula. This method, 
known as the "Boltzmann device", assumes that the 
known protein structures from the PDB correspond to 
classical equilibrium states. The side chain - side chain 
(SC-SC) potentials can be related to distance-dependent 
probability densities f(r) by the relation 



U' l3 {r) = -kT In 



f ij (r) 



(1) 



Jref(r) 

where p 3 (r) is the probability density for a side chain of 



type i to be separated by a distance r from a side chain 
of type j, and the choice of the reference probability den- 
sity f re f( r ) is very important**^. SippliS suggests that 
r can be the distance between two atoms or some other 
structural parameter such as dihedral angles. Other sta- 
tistical potentials developed subsequently using this ap- 
proach have interpreted f(r) as the distance-dependent 
probability density. 

In recent years, a number of studies have evaluated 
the goodness of statistical potentials based on pairwise 
additive interactions between residues2iiiii2iI2iA£i± 5 *i£ii£. 
These studies have concluded that pairwise additive po- 
tentials, dependent only on the radial distance between 
residues, are inadequate for structure prediction. One of 
the major drawbacks of using contact potentials or po- 
tentials that only depend on the radial distance is that 
the relative orientation between the amino acids, which 
plays a role in the packing of the protein interior, is not 
taken into account. To account for the dense interior of 
the folded states of proteins it is crucial to optimize the 
relative orientation of the participating side chains^. It 
is the purpose of this paper to examine the relevance of 
interaction potentials between amino acids that include 
the orientational dependence. 

By analyzing the angular distribution of side chains 
around amino acids, Bahar and Jernigan 19 showed that 
some residue pairs have specific coordination states with 
much higher probabilities than those expected from ran- 
dom distributions. Another indication that the relative 
residue-residue orientations are important is the recent 
success of the united-residue force field (UNRES) devel- 
oped by Scheraga and coworkero 20 : 21 : 22 . The SC-SC in- 
teractions of the UNRES potentials are parameterized 
as van der Waals potentials with a pair-specific angular 
dependence included in the well depth. The UNRES po- 
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tentials employ a generalized Gay-Berne potentialSi 2 *^ 
that assumes that the interacting sites are ellipsoids of 
revolution placed at the center of mass of the side chains. 
Although the relative side chain orientations are de- 
scribed in a simplified manner, the success of the UN- 
RES potentials demonstrates the importance of includ- 
ing orientational dependence in the side chain interaction 
potential. 

This idea is also supported by recent studies of protein 
side chain packing^ which found that current potentials 
can accommodate more than one rotamer for 95% of side 
chain positions. These studies suggest that interaction 
potentials that can discriminate between a large number 
of competing rotamer states are required. 

The main goals of this paper are (a) to extract statis- 
tical information about the relative residue-residue ori- 
entations and distances in proteins using the available 
high resolution PDB structures and (b) to investigate 
the utility of the orientational information in enhancing 
the ability of distance dependent potentials to identify 
native protein folds. These goals can contribute to a 
better understanding of the specific residue packing of 
native protein structures. Motivated by the above men- 
tioned results of Jernigan et and Scheraga et ali^S, 
we approached the first goal by defining local reference 
frames (LRFs) that permit a more precise, quantitative 
description of the relative orientation of a given pair of 
side chains. The new LRFs are related more closely to 
the three-dimensional configuration of the individual side 
chains, rather than to their relative position with respect 
to the backbone or to the neighboring residues. We could 
extract thus novel radial and angular dependent inter- 
residue statistical potentials that reflect the specific dis- 
tributions of distances and relative SC-SC orientations as 
observed in native structures of real proteins. In order to 
address our second goal we tested the efficacy of the new 
potentials by using a large database of incorrect models 
of real proteins2£i2&, known as decoys. These decoys are 
computer-generated models of protein structures, specifi- 
cally designed for being used in evaluating the capacity of 
various potential functions to distinguish the native-like 
conformations from non-native ones. 

Before presenting our methods and the results that 
we obtained, a few clarifications are necessary. First, a 
complete set of potentials for coarse-grained protein fold- 
ing simulations should take into account interactions be- 
tween side chains, interactions between side chains and 
peptide groups^, and energetics dependent on torsional 
angles that define the backbone structur o 20 i 21 i 22 . We 
only investigate the specific features of SC-SC interac- 
tions, as has been done before in studies of statistical 
potentials that are only contact or distance dependent. 
Secondly, we constructed our own corresponding set of 
distance-dependent potentials rather than employing po- 
tentials constructed by other authors. In this manner 
we minimized the possibility that a better parameter 
fitting, specific computational implementation, or other 
technical aspects could affect our results. Finally, as in 



similar studies^ we do not directly address the issue of 
the dependence of the results on the training database 
size or sampling problems. Instead, we used a stan- 
dard, reproducible approach, by employing the set of 
non-homologous proteins that was used by Scheraga et 
a j f 20 i 2i i 22 £ or s i m ii ar purposes. 



II. METHODS 

The relative residue-residue orientations are thought to 
be directly related to the nature of the forces that shape 
the specific three-dimensional structures of proteinsi2*2ii. 
However, a quantitative approach to the statistical ex- 
traction of these orientational information from high res- 
olution structures is still needed. The necessity of in- 
cluding orientation dependent interactions is established 
in section III Al by computing correlation functions that 
probe orientational order in the PDB structures. It is 
shown that, for a given native state topology, the orien- 
tational packing of side chains may be decomposed into 
a linear combination of simple cluster geometries. How- 
ever, the use of simple orientational order parameters is 
not sufficient for discriminating between basic protein 
architectures such as a-helix or /3-sheet. As such, we are 
motivated to use more detailed quantitative descriptions 
of relative residue orientations. In order to achieve this 
goal, in section lll Bl we introduce definitions of amino acid 
dependent local reference frames (LRFs) that permit a 
standard description of the relative SC-SC orientations. 
A method for extracting the radial and angular depen- 
dent pair distribution functions is presented in section 
III CI emphasizing the limits imposed on the statistical 
analysis by the accuracy of the available experimental 
database of protein structures. 



A. Measures of Orientational Order. The 
Dependence of Anisotropy on Native State Topology 

The specificity of SC-SC contacts suggests that the rel- 
ative residue orientations should play a significant role in 
determining packing in proteins. This idea, supported 
by the study of Bahar and JerniganiS, builds up on the 
more general theme of how side chains pack in the native 
states of proteinsi2i2&. The observation that the interior 
of protein structures is densely packed2L22i2ii raises the 
question if one can use a simple crystal lattice descrip- 
tion (e.g. sc, bec, etc.) for the side chain packing. While 
there are many qualitative descriptions of residue pack- 
ing, they are mainly based on the notion that hydropho- 
bic interactions are responsible for globular shapes^ and 
that specific polar interactions and hydrogen bonding 
are important stabilizing factors. Because these stud- 
ies do not address quantitatively the issue of relative side 
chain orientations, we used orientational order parame- 
ters (OOPs)2i to asses their importance. 

The Orientational Order Parameters (OOPs) were 
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introduced 3 ^ to analyze the internal structure of liquids 
and glasses and have been used as reaction coordinates 
for studying the structure of a nucleating system2*li2£i21. 
The OOPs are defined in terms of "bonds" that connect 
a central particle to its neighbors. For each "bond", one 
can compute the corresponding values of the spherical 
harmonic functions Yi m {6,(jj) that can be used as local 
order parameters 



Q, m (r) =y, m (0(r),0(r)) 



(2) 



where r is the position vector of the central particle and 
9(r) and <p(r) are the polar and azimuthal angles of vector 
r with respect to any reference frame (see, e.gi^). To 
make the Qi m values representative for describing the 
orientational order of an entire system of particles, the 
global orientational parameters are defined as 



lira 



1 N b 
i—1 



(3) 



where Nb is the total number of bonds in the system. 

Both Qi m and Qi m depend on the choice of local ref- 
erence frame in which the Y/ m values are calculated. 
Second- and third-order rotationally invariant combina- 
tions can be constructed using 
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mi+m 2 +m 3 -0 \ m l m 2 ™>3 



Qlmi Qlrii2 Ql 



(5) 



where the term 



I I I 
mi m,2 m-T, 
symbol 34 . In liquids, the ratio 



is the Wigner-3j 



Wi = Wi/ 



3/2 



(6) 



is not sensitive to the exact definition of neighboring 
bonds of a particle. The orientational order parameters 
Qi and Wi, together with the reduced order parameter 
Wi have specific values for a number of simple cluster 
geometries such as face-centered-cubic (fee), hexagonal 
close-packed (hep), body-centered-cubic (bec), simple cu- 
bic (sc) and icosahedral (icos). If only the spherical term 
(/ = 0) dominates, then it is clear that orientational po- 
tentials are not expected to be important. The emergence 
of non-zero values of Qi and Wi not only signify that side 
chain orientations are important in dense packing but 
may also point to the nature of orientational ordering. 

All the low-order values (I < 10) of the parameters 
Qi , Wi and Wi are sensitive to the structural details that 
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FIG. 1: (a) The values of the Qi, Qe, Wi and We orienta- 
tional order parameters for simple cluster geometries—. Their 
specific differences permit a quantitative analysis of the inter- 
nal order in atomic and molecular systems, (b) The values 
of the Qi, Qe, Wi and We orientational order parameters es- 
timated for three different sets of proteins with Q-helical and 
/3-strand architectures (columns). The three rows correspond 
to averages calculated for all the types of side chains (top), 
for the five most hydrophobic side chains only (middle), and 
for the five most hydrophilic residues (bottom). The values 
of Wi and We were multiplied by 10. 



are specific to the simple cluster geometries mentioned 
above 54 " 55 . However, for practical reasons, monitoring 
the values of Q4, Q$, W4 and W$ suffices to discriminate 
between the different cluster geometries 3 ^!— The values 
of these four OOPs (Fig. serve as a reference to 

which the values computed using PDB structures can be 
compared. Due to scale differences between Qi and Wi, 
the values of W4 and were multiplied by 10. 

We calculated the Q4, Q 6 , W4 and We parameters 
(shown in Fig. (IJa) for three sets of 50 proteins from 
the same family with a-helical and /3-strand architectures 
listed in Table |U For each set, we have specifically se- 
lected homologue structures to maximize the possibility 
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TABLE I: The three sets of protein structures used to investi- 
gate if a simple packing description is appropriate for protein 
residues. Highly homologue sequences corresponding to alpha 
and beta architectures were specifically selected. 



Set 1 - 


- Ig(/3) 


Set 2 - 


Hb(al) 


Set 3 - 


- Mb(a2) 


12E8 


1CFV 


1A00 


1CG5 


101M 


1CH1 


1A2Y 


1CLO 


1A01 


1CG8 


102M 


1CH2 


1A3L 


1DCL 


1A0U 


1CLS 


103M 


1CH3 


1A3R 


1DSF 


1A0V 


1DSH 


104M 


1CH5 


1A4J 


1DVF 


1A0W 


1DXT 


105M 


1CH7 


1A6V 


1EUR 


1A0X 


1DXU 


106M 


1CH9 


1A6W 


1EUS 


1A0Y 


1DXV 


107M 


1CIK 


1A7N 


1F58 


1A0Z 


1ECA 


109M 


1CIO 


1A70 


1FGV 


1A3N 


1ECD 


110M 


1C08 


1A7P 


1FLR 


1A30 


1ECN 


111M 


1C09 


1A7Q 


1FLT 


1A4F 


1ECO 


112M 


1CP0 


1A7R 


1FVC 


1AJ9 


1FLP 


1A6G 


1CP5 


1AJ7 


1GIG 


1ASH 


1FSL 


1A6K 


1CPW 


1AP2 


1GPO 


1AXF 


1GBU 


1A6M 


1EMY 


1AQK 


1HCV 


1B2V 


1GBV 


1A6N 


1FCS 


1AXT 


1HIL 


1BAB 


1GDI 


1ABS 


1HJT 


1AY1 


1HYX 


1BBB 


1GDJ 


1AJG 


1HRM 


1B0W 


1HYY 


1BIJ 


1GDK 


1AJH 


1HSY 


1BFV 


HAM 


1BIN 


1GDL 


1AZI 


HOP 


1BJM 


1IGD 


1BUW 


1HAB 


1BJE 


1IRC 


1BM3 


1IGM 


1BZ0 


1HBA 


1BVC 


1JDO 


1BRE 


1IND 


1BZ1 


1HBB 


1BVD 


1LHS 


1BWW 


1IVL 


1BZZ 


1HBG 


1BZ6 


1LHT 


1CDY 


1KEL 


1CBL 


1HBH 


1BZP 


1LTW 


1CFB 


1KEM 


1CBM 


1HBI 


1BZR 


1M6C 



of finding OOPs that have different, characteristic values 
for different protein architectures. If similar OOP val- 
ues were obtained for the homologue sets of hemoglobins 
(Hb) and myoglobins (Mb), yet different from the ones 
measured for immunoglobulins (Ig), it would be a strong 
indication that the OOPs are a sensitive measure of pro- 
tein architecture. As shown next, this is not the case 
and, therefore, the more detailed investigation of the rel- 
ative orientations of side chains is necessary. The three 
rows in Fig. ^ correspond to averages calculated: (a) 
for all the types of side chains (top), (b) only for five 
highly hydrophobic side chains (middle), and (c) for five 
hydrophilic residues (bottom). The hydrophobic side 
chains considered were He, Leu, Val, Phe and Met, while 
the polar residues were Asn, Gin, Ser, Thr and His. In 
Fig. the columns correspond to the three sets of pro- 
teins that were analyzed. The first set had fifty differ- 
ent immunoglobulin structures with /3-sheet architecture 
(left), the second set consisted of fifty hemoglobins with 
a-helical structures (middle) and, finally, the third set 
comprised fifty different myoglobin structures which also 
have a-helical architecture (right). 



While the computation of Qi and Wi in liquids is 
straight forward, for their correct calculation in proteins 
one needs to consider the following points: (1) The size 
of the protein must be large enough to obtain meaningful 
average values. Because helices or /3-sheets do not have 
"interior" points, the notion of packing itself is meaning- 
ful only for tertiary structures. Hence, the proteins must 
contain enough tertiary "structure"—; (2) A meaningful 
cut-off distance must be selected in the identification of 
near neighbors for a given side chain. The three protein 
data sets^ used in this study are sufficiently large that 
Qi and Wi can be easily computed to assess the degree 
of anisotropy in the relative residue-residue orientations. 
We used a neighbor "cut-off" distance that is 1.2 times 
the value of the position of the first peak in the radial 
distribution function for each structure. This definition 
ensured that all the side chains in the first coordination 
shell were counted as neighbors^. In the calculations 
presented here the centers of the residues were taken to 
be the geometric centers of the heavy atoms in the side 
chains. 

Comparison of the results in Fig. ^ an d Fig. 2J) shows 
that a simple "standard" crystallographic type of order 
is absent in proteins. This shows that the interior of pro- 
teins does not have a simple point group symmetry. We 
calculated the values of the correlation coefficients be- 
tween the OOPs presented in Fig. and the ones in 
Fig. These calculations show that a simple "stan- 
dard" crystallographic type of order is absent in pro- 
teins. The interior of proteins does not have a simple 
point group symmetry which would have permitted the 
development of a simple, quantitative description of side 
chains packing. In Fig. [21 are shown the correlation co- 
efficients between the theoretical OOPs values (Fig. [IJi) 
and the OOPs (Fig. QJj) calculated for the three sets of 
proteins with a-helical and /3-strand structures (Table UJ) 
. The absolute magnitudes of the correlation coefficients 
are not very meaningful when comparing small data sets, 
but their relative values can give us a quantitative indica- 
tion if the order parameter values are closer to one type 
of cluster geometry (e.g. fee) than to another (e.g. icos). 
The results are shown in Fig. by averaging: (a) over 
all the residue types (top, all), (b) over the five most 
hydrophobic residues (middle, H) and (c) over the five 
most strongly polar side chain types (bottom, P). Please 
note that the all data represents averages computed over 
all 20 amino acids, while the H and P sets correspond to 
only 5 amino acids each. Therefore, the results for the all 
set are not necessarily averages of the corresponding H 
and P values. The three curves in each figure correspond 
to correlation coefficients calculated for the immunoglob- 
ulins with /3-sheet architecture (circles), for the set of 
hemoglobins with a-helical structures (squares), and for 
the a-helical myoglobins (triangles). A different plot of 
the same correlation coefficients is also shown in Fig. |2t> 
for the values calculated for: (a) the set of immunoglobu- 
lins (top, /3), (b) the set of hemoglobins (middle, «i), and 
(c) the set of myoglobins (bottom, 02). The three curves 
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FIG. 2: (a) Correlation coefficients between theoretical orien- 
tational order parameters (OOPs) calculated for simple clus- 
ter geometries and OOPs calculated for the three protein sets 
presented in Table The results are shown by averaging: (i) 
over all the residue types (top, all), (ii) over five most hy- 
drophobic residues (middle, H) and (iii) over the five most 
strongly polar side chain types (bottom, P). The three curves 
in each figure correspond to correlation coefficients calculated 
for the first set of /3-strand immunoglobulins (circles), for the 
second set of hemoglobins with prevalent a-helical structures 
(squares), and for the third set of myoglobins with preva- 
lent Q-helical structures (triangles), (b) Same data as in 
Fig. Here, the results are shown for the values calcu- 

lated for: (i) the set of immunoglobulins (top, 0), (ii) the 
set of hemoglobins (middle, ai), and (iii) the set of myo- 
globins (bottom, ai). In each figure, the three curves corre- 
spond to correlation coefficients calculated by averaging over 
all the residue types (circles), over the five most hydrophobic 
residues (squares), and over the five most strongly polar side 
chain types (triangles). 



correspond for each figure to correlation coefficients cal- 
culated by averaging over all the residue types (circles), 
over the five most hydrophobic residues (squares), and 
over the five most strongly polar side chain types (trian- 
gles). 

The analysis of the results in Fig. pleads to the fol- 
lowing conclusions: (1) There is no clear trend towards 
a single geometry associated to side chain packing. Nev- 
ertheless, we observe significant values for Q4 and Qq 
that suggests that high order orientational correlations 
characteristic of fee, bee, or icosahedral (icos) clusters 
(or a combination of these) are observed in proteins with 
a-helical and /3-sheet architecture. (2) While there are 
strong correlations indicating an icosahedral packing of 
residues (most evident for polar residues in Fig. [2^, bot- 
tom) , we also find significant contributions from the other 
types of cluster geometries. For example, for both (3- 
strand (Fig. \Bp top) and a-helical (Fig. \Bp middle) archi- 
tectures, both the hep and sc geometries make significant 
contributions. This is in accord with the observations of 
Bagci et ali^ who showed that there is a general uniform 
distribution of residues in proteins, two-thirds being ap- 
proximately fee packed and one-third occupying random 
positions. We also observed high correlations between 
the fee packing and the orientational order of both hy- 
drophobic and strongly polar residues in some a-helical 
molecular structures like in myoglobins (Fig. |5p, middle 
and bottom and Fig. \Bp bottom), but there is also a 
high correlation for the icosahedral character of packing 
in those cases. (3) There is a strong similarity in the dis- 
tribution of OOPs between the /3-sheet immunoglobulin 
structures and the a-helical hemoglobins (Fig. \Bp) . How- 
ever, there is considerable difference in the orientational 
order between the two a-helical proteins (hemoglobins 
and myoglobins). This suggests that even within a given 
fold there can be variations in the precise orientational 
registry of the side chains. Similarly, it appears that 
proteins with vastly different architecture can have sim- 
ilar orientational order. A more detailed investigation 
of orientational order in a large data set of proteins is 
warranted. 

The results presented here show that the bond- 
orientational order parameters are useful quantitative 
tools in investigating the orientational order of side 
chains in proteins. They also demonstrate that, for a 
given native state topology, the orientational packing of 
side chains may be decomposed into a linear combination 
of simple cluster geometries. The importance of orienta- 
tional degrees of freedom in the packing of side chains re- 
inforces the need for deciphering anisotropic inter-residue 
potentials from PDB structures. A method to extract 
such potentials is described next. 



B. The Local Reference Frames for Amino Acids 



As demonstrated in the previous sections, we need to 
define local reference frames (LRFs) for all the types of 
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He-He 



lle-Arg 
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X 





FIG. 3: The definition of a local reference frame (LRF) for 
residue-residue interactions. The local Oz axis is defined by 
the positions of the C a and Cp atoms and the local Ox axis 
is defined by the positions of the Cp and C 7 atoms. In all 
the cases, the local reference frames should be right handed. 
The interaction centers S* are defined as the centers of mass 
of the heavy atoms in the side-chains. The details are given 
in the text. 



protein side chains in order to build a quantitative de- 
scription of their relative orientations. The definition of 
LRFs for SC-SC interactions is illustrated in Fig. |3] The 
local Oz axis is denned by the positions of the C a and 
Cp atoms. The local Ox axis is defined by the positions 
of the Cp and C 7 atoms. The right handed nature of the 
LRFs determines automatically the direction of the Oy 
axis if Ox and Oz are known. Once the local axes are 
defined, the polar angles 9 and <j) and the inter-residue 
distances r can be used as internal degrees of freedom 
that describe the relative residue-residue orientations as 
shown in Fig. |31 These definitions are altered for the 
following special cases: (I) Gly does not have Cp and 
the local Oz axis is defined by the bisector of the angle 
defined by N\ C l a and C\ (2) both Gly and Ala do not 
have C 7 and the local Ox axis is defined as parallel to 
the direction defined by the backbone atoms N l and C l , 
(3) for Cys and Ser the corresponding coordinates of the 
S and O atoms are substituted for the coordinates of the 
missing C 7 , and (4) the coordinates of the midpoint be- 
tween the two C 7 atoms are used to define the direction 
of the Ox axis for He and Val. For each amino acid the 
interaction center (S l ) is defined as the center of mass 
of the heavy atoms in the side chain with the exception 
of Gly for which the position of S l coincides with C a . 




Arg-lle 



Arg-Arg 














Jfl 








FIG. 4: Orientational probability density maps for Ile-Ile, 
lle-Arg, Arg-lle and Arg-Arg interactions. The probability 
amplitudes correspond to the scale shown in the scale bar, in 
units of 10 -3 . Extreme cases of highly hydrophobic (He) and 
the highly hydrophilic (Arg) residue are chosen to investigate 
if hydropathic properties are reflected in maps. Sample local 
reference frames (LRFs) for He and Arg are also depicted. 
The origins of the LRFs are placed in the centers of mass of 
the heavy atoms in the respective side-chains. 



Sample LRFs for He and Val are also shown in Fig. 



C. Finding the structure-derived potentials 

From the definition of the LRFs, it follows that the 
SC-SC interaction potentials depend on five independent 
parameters that define the relative positions of two dis- 
tant side-chains i and j: ry, <j)%j, Oij, 4>ji and 9ji. We 
assume that these are the only geometrical factors that 
influence the two-body interactions between two residues 
i and j Qj — i\ > 4). One more independent parameter 
(a torsional angle around ) can be used for a complete 
description of the relative orientations of the two residues 
in three dimensions, but it is not employed in this work 
because its influence is expected to be important only 
for short range interactions. Accounting for this sixth 
angular parameter would also increase dramatically the 
statistical requirements for the protein database that is 
analyzed. Assuming pairwise interactions, the distance 
and orientation dependent potential for the residue pair 
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ij is 



(7) 



We further assume that U^q can be decomposed as 

U DO i ri 3 ' ^ij ' > ^ji ) = ^ DO i ri 3 ' ' ®ij ) 

+^o(^'i.^i.%) (8) 
We use the Z7do notation for the statistical poten- 
tials that are both distance and orientation dependent, 
and the Ud notation for potentials that depend solely on 
inter- residue distances. The assumption in Eq. [8] on the 
separability of potentials is not always valid. For a sys- 
tem of interacting side chain pairs, described by a Boltz- 
mann equilibrium, Eq. |HJis consistent, however, with the 
probabilistic relation 



P total ' &j ' % ' ' Qji) 



xl»'(r,,.o „.!),,) (9) 



where P? otal (rij , 4>ij , 



ij, (f>ji, Oji) is the probability to find 
a pair of interacting side chains i and j separated by a 
distance — rji between their interaction centers, and 
with relative orientations given by the set of (4>ij, Qij) an- 
gles in the LRFi frame, and by the set of (<fiji, Oji) angles 
in the LRFj frame (see Fig. [3J. Eq. [5] implies, there- 
fore, that the relative orientations of the local reference 
frames LRFi and LRFj of two interacting side chains i 
and j do not depend on each other. As suggested by 
previous studiesi2*i2i2£ j this type of independence could 
be expected for side chains that are separated by a large 
enough number of peptide bonds along the backbone. 
This assumption is reasonable in our analysis because 
only side chains that are separated by at least five peptide 
bonds along the protein backbone are considered. This 
corresponds to residues that are found on a "topological 
level" k > A Besides this constraint, for simplicity, we 
are also considering (as in 30 ) that the SC-SC interactions 
are independent on sequence separation (i.e. for k > 5, 
all side chains are on the same topological level). 



The Boltzmann Device 



for the distance and orientation-dependent distributions. 
To be consistent with previous studies, we consider the 
reference pair distribution functions P re f to be the cor- 
responding radial or angular pair distributions that are 
obtained through an analysis of all twenty residue types. 
A database of non homologous proteins can be used to 
estimate the pair distributions and to extract amino acid 
specific interaction potentials that are consistent with a 
large set of protein structures. 

An important issue that appears when using probabil- 
ity density functions with the Boltzmann device for con- 
structing statistical potentials is "the problem of small 
data sets" . As noted by SipplA, dividing the SC-SC pair 
frequencies by both side chain type and distance inter- 
vals results in situations when the available data is too 
small for conventional statistical procedures. This prob- 
lem was solved by Sippl by proposing a "sparse data 
correction" formula that builds the correct probability 
densities as linear combinations between the measured 
data and the reference, total probability densities ob- 
tained by averaging over all twenty SC types. For the 
general, orientation-dependent probability densities the 
sparse data correction can be written as 



1 



1 + m'a 



—P re f{r,(j>, 



+ —^PHrA,0) (12) 
1 + ma 



where P IJ are the actual probability densities obtained 
from the database for the ij pair of side chains, P% Tr 
are the corrected probabilities, and P re t is the reference 
probability density. A modification introduced by the 
orientational dependence in our case is that the num- 
ber of measurements m becomes m! = m/ sin(9k), as k 
equiangular intervals are used for the angle. This is 
necessary for accounting for the azimuthal dependence 
of volume elements in spherical coordinates. The a pa- 
rameter is a constant that controls how many actual mea- 
surements m! must be observed so that both the actual 
probabilities and the reference would have equal weights. 
As in other studies, we used the value a = l/502i2L2S. 



The Boltzmann device assumes that the known pro- 
tein structures from protein databases (such as PDB) 
correspond to classical equilibrium states. The SC-SC 
potentials can therefore be related to position probabil- 
ity densities f(r) (see Eq. 0, where r can be the radial 
distance or the angle between the side-chains^. In many 
studies, f(r) can be replaced by the normalized pair dis- 
tribution functions g(r) such that 



ir£{r) 



-kT In 



9°- 



(10) 



.9ref(r)_ 

for the distributions depending only on distances. We 
adopt a more general treatment that defines 



-kT In 



Pref (r, 4,0) 



(11) 



2. Residue- Residue Interaction Probability Maps 

Using our proposed definitions of residue-specific lo- 
cal reference frames (introduced in section III Bl) and the 
histogram extraction procedure (described in detail in 
Appendix „), we constructed radial and orientational SC- 
SC interaction probability maps. All the combinations 
of side chain types were investigated. From the normal- 
ized angular histograms, the resulting probability den- 
sity maps for residue-residue interactions were calculated. 
For purpose of illustration, the data shown in all the 
probability density maps presented in this paper were 
obtained by averaging over the entire distance range 
for which the radial and angular histograms were con- 
structed (2 A to 26A), as explained above. 
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all-all H-all 




FIG. 5: Orientational probability density maps for all-all, 
H-all, LH-all and P-all interactions. Here "all", "H", "LH", 
and "P" are virtual residues. The "all" type is obtained by 
averaging over all the observed SC-SC orientations. "H" is ob- 
tained by averaging over the five highly hydrophobic amino 
acids (He, Val, Leu, Phe, Met), "LH" corresponds to the av- 
erage of eight less hydrophobic residues (Ala, Gly, Cys, Trp, 
Tyr, Pro, Thr, Ser), and "P" is obtained by averaging over 
seven highly hydrophilic amino acids (His, Glu, Asn, Gin, 
Asp, Lys and Arg)^^S. The hydropathic properties and ef- 
fects due to the finite size of proteins are clearly reflected in 
these maps. The probability amplitudes correspond to the 
bar scale shown, with units of 10 -3 . 



The orientational probability maps for interactions be- 
tween He (one of the most hydrophobic residues) and Arg 
(highly hydrophilic) are shown in Fig. ^ together with a 
schematic representation of their associated LRFs. The 
LRFs in Fig. 0] are shown in the proximity of the C a , 
to illustrate the computational process of constructing 
them for actual structures. However, for calculations the 
LRFs are translated such that their origin corresponds 
to the geometrical center of the heavy atoms in the side 
chains (i.e. the interaction centers Si). The orienta- 
tional anisotropy is stronger around Arg than around lie 
(Fig. 2J), which is possibly due to the relatively longer 
Arg side chain. Lys, another strong hydrophilic residue, 
presents a relative orientational interaction probability 
map similar to Arg but with an even stronger anisotropic 
character. The similarity between the Ile-Ile and He- Arg 
maps and between the Arg-Ile and Arg- Arg is largely a 
reflection of the fact that the statistical data collected 
here does not depend on the relative orientation of the 
second side chain (SCj) in the reference system of the 
first SC (LRFi). This finding may be a consequence of 
the assumptions behind Eq. |5J 

The most noticeable difference between the maps cal- 
culated in the LRFs of He and Arg is that the relative 
amplitudes of the interaction probabilities at the "poles" 
are reversed. In other words, the most preferred interac- 
tion loci for He is around its "north pole" (i.e. towards 
the positive Oz axis, away from the local backbone) while 
for Arg is in the proximity of its "south pole" . To investi- 
gate if this feature is due to the hydrophobic /hydrophilic 



character, we computed average orientational interaction 
probability maps for three groups of residues: (a) five 
highly hydrophobic (H) residues He, Val, Leu, Phe, Met, 
(b) eight less hydrophobic (LH) residues Ala, Gly, Cys, 
Trp, Tyr, Pro, Thr, Ser, and (c) seven highly hydrophilic 
(P) residues His, Glu, Asn, Gin, Asp, Lys and Arg. 
This classification was suggested by various hydropathy 
scales 39,40 . The computed average maps are presented 
in Fig. [3] together with the maps for the average virtual 
residue "all". 

The "all-all" map shows that, for any residue there is a 
slightly higher probability to find more residues along the 
attached negative Oz axis in their respective LRFs than 
in the "northern hemisphere" . The definition of the LRFs 
was made such that the positive Oz direction points away 
from the nearest backbone atoms. Due to the finite size 
and the relatively compact three-dimensional shapes of 
most proteins analyzed here, we do expect to find less 
residues in that direction for higher SC-SC distances and, 
therefore, at larger distances from the backbone. How- 
ever, the "H-all" map (Fig. |SJ) shows that the highly hy- 
drophobic amino acids present the reverse situation: a 
higher average probability to find other residues at the 
"north pole" of their side chains. This arises because such 
residues are mostly found in the "interior" of proteins, 
"protected" from water. The same arguments apply to 
the observation that, on average, the highly hydrophilic 
residues have preferential coordination loci shown on the 
"P-all" map in the proximity of their "south pole" , as it 
is expected that their positive Oz axis points more of- 
ten toward the water molecules. It is reassuring that the 
average over the less hydrophobic residues, shown in the 
"LH-all" map of Fig. does not reflect the existence of 
significant orientational preferences. 



III. RESULTS: HOW IMPORTANT IS THE 
INFORMATION ON RELATIVE 
RESIDUE-RESIDUE ORIENTATIONS? 

We performed tests using the distance and orientation 
dependent statistical potentials {Udo) as scoring func- 
tions and we assessed their performance on a standard 
database of decoys developed by Samudrala and Levitt. 
The results are compared with respect to the perfor- 
mance of potentials dependent solely on distance (Up). 
Similar decoy tests have been shown to be useful in ana- 
lyzing the ability of various potential schemes to correctly 
recognize the native stat o 17 ' 41 . There are two main as- 
pects that need to be taken into consideration when using 
decoy protein structures for tests: (1) The energy of the 
native state should be as low as possible, and (2) The 
decoy structure that has the lowest root mean square C a 
distance (RMSD) from the native state should also have 
a low energ y 17 i 41 i 42 . For studying these aspects from a 
quantitative point of view, we use the Z-score for both 
the energy (Ze) and the RMSD deviations of the decoy 
structures (Zrmsd)- The definition of these two types 
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Z = (x Q -<x>) / a 
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FIG. 6: The RMSD and energy Z-scores for decoy sets of 
protein structures. The distance-dependent energy (Ud) for 
the set of 654 decoys of Calbindin (3icb) , is plotted as a func- 
tion of the C a RMSD. The Z-scores are proportional to the 
distance of the corresponding parameter of interest (depicted 
as interrupted lines) from the mean values of their distribu- 
tions (solid lines), in units of standard deviations a. 



of Z-scores is illustrated in Fig. [5] where the distance- 
dependent energies (Ud) that we computed for the set of 
654 decoys of Calbindin (3icb), is plotted as a function 
of the C a RMSD, calculated with respect to the native 
state. The Z-score of a statistical quantity x is 



Z = 



(13) 



where a is the standard deviation and x is the mean of 
the distribution of x values. In our case, we use x both 
as the energy and the RMSD of the set of decoys. In 
all cases, a good scoring function will lead to a nega- 
tive Z-score. In Fig. El we show the radial energy term 
computed using our method for the Calbidin decoy set 
(3icb in2&). The high similarity between this plot and 
the energies depicted in Fig. 1 of Samudrala and Levitt^ 
confirms that the present radial potentials are essentially 
the same as those used by other groups. In this figure, 
the circle shows the position of the native state and the 
diamond shows the position of the decoy structure that 
has the lowest energy. For a potential (or scoring func- 
tion) that is efficient in discriminating the native state 
of a protein from a set of decoys it is expected that (1) 
the native state (circle) corresponds to the lowest inter- 
action energy, and (2) the decoy with the lowest energy 
(diamond) should have as small an RMSD as possible. 
Both criteria are important and we find that both the 
Ze and Zrmsd scores defined here are useful, quantita- 
tive measures of the performance of the potential energy 
function. These Z-scores are proportional to the distance 
of the corresponding parameter of interest (depicted in 
Fig. as interrupted lines) from the mean values of their 
distributions (solid lines) , in units of standard deviations 



Due to the large number and diversity of decoy sets 
that are employed in this paper, we do not repeat the 
specific description of the methods used for generating 
each decoy set and of their names (e.g. "single" , "lmds" , 
"fisa", etc.). That information is provided by the "De- 
coys 'R' Us" database (http://dd.stanford.edu) and by 
the corresponding publications^. Details are provided 
for the decoys that are relevant to the results obtained 
for our tests. 

In Fig. [7| are compared total statistical potential val- 
ues obtained for distance-only dependent potentials (Ud, 
squares), and for distance and orientation dependent po- 
tentials (Udo, circles) for decoy structures in the "single" 
sets ( "misfold" and "pdb_error" ) of the "Decoys 'R' Us" 
database^. 

The "misfold" set contains, for each native protein, 
an alternative structure generated by placing the same 
sequences on different folds with the same number of 
residues (e.g. the "lbp2on2paz" notation means that the 
"lbp2" sequence has been placed on the "2paz" fold2£). 
The incorrect structures in the "pdb_error" set are exper- 
imental structures that have been substantially re-refined 
or found to contain errors. The points united by dotted 
lines (filled symbols) correspond to incorrect structures, 
while the continuous lines (open symbols) join scores ob- 
tained for correct states (i.e. "native" for the "misfold" 
set, and "latest" for the "pdb_error" set). The arrow in 
Fig. emphasizes the case where the Ud score fails to 
identify the native state, while Udo succeeds. Overall, 
the "single" decoy sets proved to be relative easy tests for 
both the Ud and U do scores. The orientation dependent 
Udo scores consistently result in better (i.e. smaller) val- 
ues for the correct protein structures in all cases studied. 

In all the tests performed, some of the decoy sets that 
were not appropriate for analysis were eliminated. For 
example, non-standard side chains were present in the 
decoys or in the native structures, for which no U do po- 
tentials were constructed. Another reason for eliminating 
a few structures was the absence of enough heavy atom 
coordinates in some of their large side chains. This pre- 
vented the construction of LRFs for those side chain and, 
therefore, the correct estimation of their Udo potentials. 

Much more challenging decoy tests than the ones pre- 
sented above for the "single" decoys can be performed 
using the "multiple" decoy sets from the "Decoys 'R' Us" 
database^ 8 . The main goal of these tests is to identify the 
native structure from a set of many (i.e. tens, hundreds, 
or even thousands) of decoys. We discuss next the re- 
sults of these tests using the Ze and C Q RMSD Z-scores 
defined as shown in Fig. From the "multiple" decoy 
sets available in the "Decoys 'R' Us" database, the "lat- 
tice" set was eliminated from the analysis presented here 
because the new Udo statistical potentials used in this 
paper were constructed using a training set of real, non- 
homologous protein structures. The specific constraints 
imposed by the tetrahedral lattice topology on side chain 
conformations for the "lattice" decoys affect more nega- 
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FIG. 7: Results of tests on the "single" decoy sets»& that consist of pairs of correct (i.e. native) and incorrect structures. 
The plot compares the total statistical potential values obtained for distance-only dependent potentials (Ud, squares), and for 
distance and orientation dependent potentials (Udo, circles). The points united by dotted lines (filled symbols) correspond to 
incorrect structures, while the continuous lines join scores obtained for correct states (open symbols). The arrow emphasizes 
a case where the Ud score fails to identify the native state, while Udo succeeds. These types of tests, concerned with 
discriminating the native state from a single decoy, are relatively easy and both the Ud and Udo scores succeed to identify the 
"correct" structures in a majority of cases. 



tively the performance of the U do potentials than the 
performance of the distance-only dependent scores, Ud- 
In Fig. IS] are shown energy (Z E ) and C a RMSD Z- 
scores (Zrmsd) calculated for the multiple decoy sets 
"lmds" , "fisa_casp3" , "fisa" and "4state"^&. Z-scores cal- 
culated for the distance-only dependent statistical po- 
tentials Ud (dark bars) and for the new Udo potential 
(white bars) values are compared. In Fig. [S] and in all 
the subsequent figures presenting Z-scores, more nega- 
tive values are better, and the cases in which Udo gives 
better results than using Ud are emphasized using ar- 
rows. It is observed that in a large number of cases the 
inclusion of oricntational information improves the per- 
formance of both Ze and Zrmsd scores. An especially 
interesting case is the one of the "lmds" set in which 
all-atom distance dependent scores were shown to per- 
form poorlySSi^. In this case, when both the Ze and 
Zrmsd Z-scores are considered, the new distance- and 
orientation-dependent potentials Udo performed much 
better than the distance-only dependent Ud in a major- 
ity of cases. 

In Figs. 1911 II are shown the corresponding results for 
the "hg_structal" , "ig_structal_hires" and "ig_structal" 
decoy sets of the "Decoys 'R' Us" database. For the 



"hg_structal" decoy sets of globins, improvements are 
observed in about a third of the test cases. However, 
for both the "ig_structal_hires" and "ig_structal" decoy 
sets of immunoglobulins, more than 70% of tests present 
an improvement in the performance the energy Z-scores 
when Udo potentials are employed (Ze(Udo))- 

The results for all the sets of "multiple" decoy tests 
are summarized in Table [H] As mentioned above, 
particularly strong improvements are observed when 
using Udo for the "lmds" set, for which distance- 
dependent scoring functions were reported before to have 
a weak performanc o 28 i 43 . The very good energy Z-scores 
(Z E (U D o) < Z E (U D ) for more than 70% of the de- 
coy sets) that were obtained for the large decoy sets of 
immunoglobulins ( "ig_structal" and "ig_structal_hires" ) 
suggest that considering the orientational dependence 
(Udo) in the cases of proteins with a significant /3-sheet 
architecture it is more important than for proteins that 
are mainly a-helical (e.g. the "hg_structal" sets). At the 
same time, the fact that U do improves both Z-scores in 
about a third of the "hg_structal" decoy tests, could be 
explained by the globular shape of these predominantly 
a-helical structures. These results agree to the rather 
intuitive observation that details about the relative side 



11 



Imds 



fisa_casp3 



fisa 



4state 




4pti (344) 
2ovo (348) 
2cro (501) 
1shf (438 
1 igd 501 
1fc2 (501) 
1dtk (216) 
1ctf (498) 
1bba (501) 
1 bOn (498) 

1iwe (1408 
1eh2 (2414' 
1 blO (972) 
1bg8 (1201; 

4icb (501) 
2cro (501) 
1hdd (501) 
1fc2 (501) 

4rxn (678) 
4pti (688) 
3icb (654) 
2cro (675) 
1sn3 (661) 
1 r69 (676) 
1ctf (631) 



Imds 



fisa_casp3 



fisa 



4state 




"RMSD 



FIG. 8: The energy (Ze) and C a RMSD Z-scores (Zrmsd) calculated for the multiple decoy sets "Imds", "fisa_casp3" , "fisa" 
and "4state"— The numbers in brackets represent the number of decoys in each set, including the native structure. Z-scores 
calculated using statistical potentials dependent only on distance Ud (dark bars) and distance and orientation dependent 
potentials (Udo, white) are compared. More negative Z-scores are better, and the cases in which Udo gives better results 
than using Ud are emphasized by the arrows on the left. When both Ze and Zrmsd Z-scores are considered, the inclusion of 
orientational information improves the performance in a majority of cases. 



chain orientations could play a more significant role in 
compact, globular proteins than in proteins with rela- 
tively extended chains. Overall, it is shown that, for a 
majority of test cases, the Ze and the Zrmsd Z-scores 
are improved by including the relative SC-SC orienta- 
tions in the structure analysis. This demonstrates that 
the performance of the statistical potentials in discrim- 
inating the native state is significantly enhanced by the 
inclusion of orientation dependent information. 

The results of the Z-score calculations for the de- 
coy sets analyzed in this paper show that there are 
cases where the more detailed, distance- and orientation- 
dependent potentials (Udo) do not provide a better per- 
formance in discriminating the native state as compared 
to using simple distance-dependent potentials (Udo)- 
Since both the Udo an d U d potentials employed here 
were extracted from the same protein database and for 
the same number of radial bins, it is expected that the 
Udo values, while more detailed, are at the same time 
more prone to statistical errors than the Ud potentials. 
The sparse data correction procedure ensures that rea- 
sonable values are employed for situations when little or 
no specific statistical data on relative residue-residue dis- 
tances and orientations is available. However, this pro- 
cedure alone can only ensure that realistic, accurate sta- 



TABLE II: Results of tests for the "multiple" decoy sets*&. 



The "multiple" 


Z e (Udo) 


Zrmsd(Udo) 


decoy sets™ 


< Z E (U D ) 


< Zrmsd(Ud) 


Imds 


50.0% 


80.0% 


fisa_casp3 


25.0% 


100.0% 


fisa 


50.0% 


50.0% 


4state 


28.6% 


57.1% 


hgjstructal 


34.6% 


26.9% 


ig_structal_hires 


78.6% 


28.6% 


ig_structal 


71.4% 


28.6% 



"The Udo potentials were not "trained" for the lattice topology 
and, therefore, the "lattice" set2& was excluded from these tests 
(see text). 



tistical potentials are obtained when a very large train- 
ing database of non-homologue protein structures is em- 
ployed. The relatively reduced size of the database of 
structures employed here (and inSS) for extracting the 
statistical data could be an important factor responsible 
for the reduced U do performance in some cases. 

The type of protein architecture and the number of 
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FIG. 9: Energy (Ze) and C a RMSD Z-scores (Zrmsd) calculated for the multiple decoy sets "hg_structal'^. The numbers 
in brackets represent the number of decoys in each set, including the native structure. Z-scores calculated using statistical 
potentials dependent only on distance (Ud, dark bars) and distance and orientation dependent potentials (Udo, white bars) 
are compared. More negative Z-scores are better. The cases in which Udo gives better results than using Ud are emphasized 
by the arrows on the left. 



residues are also factors that could influence the relative 
performance of the Udo and Ud potentials. A useful 
quantitative parameter that is directly related to the type 
of protein architecture is the contact order (CO)^^^& 
of a protein structure. The CO definition employed here 
is 

co = ? E AS ^ ( 14 ) 

c <i,j> 

where N c is the total number of contacts and AiSy is 
the sequence separation, in residues, between contacting 
residues i and j. Two residues are considered to be in 
contact when \i — j\ > 1 and any of the heavy atoms 
of residue i is within 3.75 A of any of the heavy atoms 
of residue ja£. When the CO is small the protein struc- 
ture presents mostly local contacts (i.e. \i — j\ < 6), and 
when CO is larger, the contacts are nonlocal^^. As 
shown in Fig. 112b we observed an inverse proportional- 
ity between the CO values calculated for all the native 
states of the decoys analyzed in this paper and the frac- 
tion of local contacts for various protein architectures. 
Based on this observation, we investigated the relative 
performance of the Udo and U d potentials as a func- 
tion of CO values (Fig. 112b ) and sequence length (Fig. 



1121). In Figs. 112b and 112b is shown the dependence of 
the difference AZe = Ze(Udo) ~ Ze(Ud) for the en- 
ergy Z-scores on contact order values (CO) and sequence 
length of the native state (N). Negative AZe values cor- 
respond to better performing scores for the distance- and 
orientation-dependent potentials (Udo)- The lines rep- 
resent linear fits. While the trends are not very strong, it 
is observed that, for the energy Z-scores the novel Udo 
statistical potentials perform better then the U d poten- 
tials for longer proteins, presenting large contact orders, 
such as the /3-sheet structures from the "ig_structal" and 
"ig_structal_hires" sets. It is observed that the inclu- 
sion of the information on relative SC-SC orientations 
makes the current version of the Udo statistical poten- 
tials more useful than Ud for relatively large, globular 
proteins (N > 100 residues), with a significant content 
of /3-sheet structure and nonlocal contacts. At the same 
time, it is also observed that even for relatively short 
(N < 100 residues) and a-helical proteins (low CO val- 
ues), both the Ze and Zrm § d -scores can be improved 
when using U do potentials but for a smaller number of 
cases. These results suggest that details about side chain 
- backbone interactions should be included in statistical 
potentials for short or a-helical proteins with a high con- 
tent of local contacts. 
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FIG. 10: Energy (Ze) and C a RMSD Z-scores (Zrmsd) calculated for the multiple decoy sets "ig_structal" (I s * part) and 
"ig_structal_hires" 28 '. The numbers in brackets represent the number of decoys in each set, including the native structure. Z- 
scores calculated using statistical potentials dependent only on distance (Ud, dark bars) and distance and orientation dependent 
potentials (Udo, white bars) are compared. More negative Z-scores are better. The cases in which Udo gives better results 
than using Ud are emphasized by the arrows on the left. 



IV. CONCLUSIONS 

Our results demonstrate that it is possible to im- 
prove the performance of energy based scoring functions 
by using statistical information extracted from the rela- 
tive residue-residue orientations. Calculations of orien- 
tational order parameters for investigating the residue 
packing in proteins showed that architecture dependent 
packing is best described by a linear combination of sim- 
ple cluster geometries (fee, icos, hep and bcc). This result 
reinforces the need for extracting orientation dependent 
potentials using PDB structures. We have defined a novel 
local reference system for each residue for quantitatively 
describing the relative three-dimensional orientations in 
a manner that is independent of the neighboring amino 
acids. Arguments based on experimental resolution of 
protein structures were used to derive the optimal bin 
size employed for collecting statistical data with both 
angular and distance dependence. The orientational in- 
formation has both common and complementary signif- 
icance as compared to the information that can be ex- 
tracted from the relative residue-residue distances alone. 
The tests that we performed on a standard data base of 
artificially generated decoy structures suggest that this 
complementarity can be very important for a large class 



of protein structures. The novel distance and orientation 
dependent statistical potentials were shown to present 
an enhanced ability to recognize native-like protein folds, 
especially for larger structures with high contact orders 
(e.g. immunoglobulins). They should find use in con- 
structing the next generation of coarse grained off-lattice 
protein simulations. These new potentials could also be 
instrumental in developing more efficient computational 
methods for structure prediction on much larger scales 
than it is currently possible, addressing one of the ma- 
jor goals of proteomics. To achieve this it will be impor- 
tant to also include the relative orientations between side 
chains and the protein backbone^. 
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FIG. 11: Energy (Z E ) and C a RMSD Z-scores (Zrmsd) calculated for the multiple decoy sets "igjstructal" (2 nd part)^£. 
The numbers in brackets represent the number of decoys in each set, including the native structure. Z-scores calculated using 
statistical potentials dependent only on distance (Ud, dark bars) and distance and orientation dependent potentials (Udo, 
white bars) are compared. More negative Z-scores are better. The cases in which Udo gives better results than using Ud 
are emphasized by the arrows on the left. It is observed that in a majority of cases the inclusion of orientational information 
improves the performance of both Ze and Zrmsd scores. 



VMD 48 and Matlab (The Mathworks, Inc., Natick, MA). 



APPENDIX: HISTOGRAM EXTRACTION 

Previous studies suggest that only protein structures 
that are determined with a resolution of 2A or better 
should be used in the computation of g v (r)22i from pro- 
tein databases 3 . A resolution of 2A for protein struc- 
tures that are determined by X-ray crystallography corre- 
sponds to an accuracy of +/- 0.2 A in atomic positions^. 
The 2A resolution is often good enough to accurately 
assign hydrogen bonds and to allow for a limited inter- 
pretation of solvent structure. 

The high resolution structural data are binned for the 
construction of either radial or orientation dependent 
pair distributions g lJ (r) and g lJ (4>,6). It is important, 
therefore, to identify the minimal radial and angular bin 
sizes that ensure a certain high level of confidence that 
the information has been analyzed correctly and that no 
important artifacts due to the limitations of the exper- 
imental methods have been overlooked. In building a 
statistical distribution, the data that is collected from 
protein structures^ consists of inter-atomic distances. We 



expect that such data has an accuracy of dR = +/ — 0.4A. 

For the radial dependence of the data, let the bin size 
be L. In the process of assigning a certain pair distance to 
a certain bin, if that value is exactly at the bin boundary, 
we have a probability of correct assignment P ca = 0.5. 
On the other hand, if the value is exactly in the mid- 
dle of the bin, this probability is maximum (if L > 2dR 
than P ca = 1). We estimate that the probability of cor- 
rect radial assignment depends on the bin size L and the 
accuracy dR as 

,L/2 

P(L,dR) = 2/L / P ca (x,dR)dx (A.l) 
Jo 

This estimate is based on the assumption that the prob- 
ability of correct assignment to a certain bin of a certain 
pair distance value is directly proportional to the differ- 
ence x between that value and the nearest bin boundary, 
or that P ca (x,dR) = a + bx. Using the above assumed 
conditions for the values of P ca at the bin middle and on 
the boundary, we find that 

P ca (x,dR) = (1 + x/dR)/2 if x < dR 
P ca (x,dR) = 1 if x>dR (A.2) 
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FIG. 12: (a) The dependence of the percent of local contacts (LC) on the contact order (CO) of the native structures for all 
the decoy sets analyzed in this paper, (b) The difference AZe = Ze(Udo) — Ze(Ud) for the energy Z-scores plotted versus 
the CO of the native state for each decoy set. (c) AZe plotted versus the number of residues (N) for each decoy set. Negative 
AZe values correspond to better performing scores for the distance- and orientation-dependent potentials (Udo)- The lines, 
representing linear fits of AZe, are shown for emphasizing the trends. 



Therefore, for the average probability of correct assign- 
ment we find that 

P(L, dR) = 1 if L > 2 dR and 

P(L,dR) = 0.5 + — — if L < 2 dR. (A.3) 
8 dR 

From this result we can infer that for dR = 0.4A, if we 
choose a radial bin size L = 0.5 A, the confidence of cor- 
rect bin assignment is about 65.6%. In this work we 
employ a radial bin width of L ~ 1.2 A. This bin width is 
at least twice larger than the values employed by other 
authors. However, it ensures that we have a confidence 
level of at least 83.3% that the radial bin assignment is 
correct^ 



Similar considerations are used for studying the an- 
gular orientation pair distributions. Consider the polar 
coordinate system and the division of the and </> values. 
We assume that we have N bins in 9 and 2N bins in 
<p. The solid angle corresponding to one angular bin is 
dVL = 2tt/N 2 . This solid angle is proportional to the ra- 
tio between the area element that insures a certain high 
value for the confidence level of correct assignment, and 
the square of the corresponding minimum pair distance 
(radius) at which we expect that confidence level. Let us 
require a confidence level of 80% at a distance of 5A. At 
larger distances the confidence level will be higher, but it 
will decrease at closer ranges. Since d£l « L 2 / (R + dR) 2 , 
and L = 1.2k, R = 5.0A and dR = 0.4A, we have 
dfl w 0.22 2 = 2tt/N 2 . This reasoning leads us to em- 
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ploy the nearest even value of N = 12 for the number 
of bins that divide the angular interval for 6. Previous 
studies 19 used a smaller value (N = 3). 

The use of a larger radial bin size (1.2A vs. 0.5A) 
than previously suggested, is needed to ensure a high 
confidence level (at least 83.3%) of correct radial bin as- 
signment. Here, we employ twenty radial distance bins 
with L = 1.2A for distances starting at 2 A. When we 
collect statistical data on the relative 3-dimensional ori- 
entation of pairs of residues, we obtain a similarly high 



confidence level by using N = 12 for 6 and N = 24 for <f>. 

All the calculations presented in this section are based 
on the assumption that all protein structures analyzed 
have a resolution of 2A or better. If proteins of differ- 
ent structural resolution are analyzed, the optimal bin 
size can be estimated accordingly. These arguments as- 
sure that optimal radial and angular bin sizes (i.e. values 
that are small enough to provide a good resolution, yet 
large enough to correspond to a high statistical confi- 
dence level) are employed. 
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